Overview

Background

During the summer of 2012, wild fires ravaged throughout the Algerian territory covering most of the northern part, especially the coastal cities. This disaster was due to the higher than average temperatures which reached as high as 50 degrees Celcius.

Objectives

One important measure against the reproduction of such disasters is the ability to predict their occurrence. Moreover, in this project, we will attempt to predict these forest fires based on multiple features related to weather indices.

Dataset Description

The Dataset we will use to train and test our models consists of 244 observations on two Algerian Wilayas (cities): Sidi-Bel Abbes and Bejaia. The observations have been gathered throughout the duration of 4 months from June to September 2012 for both cities.

The Dataset contains the following variables:

  1. Date: (DD/MM/YYYY) Day, month (‘june’ to ‘september’), year (2012)
  2. Temp: temperature noon (temperature max) in Celsius degrees: 22 to 42
  3. RH: Relative Humidity in %: 21 to 90
  4. Ws: Wind speed in km/h: 6 to 29
  5. Rain: total day in mm: 0 to 16.8
    FWI Components (check this LINK for more information)
  6. Fine Fuel Moisture Code (FFMC) index from the FWI system: 28.6 to 92.5
  7. Duff Moisture Code (DMC) index from the FWI system: 1.1 to 65.9
  8. Drought Code (DC) index from the FWI system: 7 to 220.4
  9. Initial Spread Index (ISI) index from the FWI system: 0 to 18.5
  10. Build-up Index (BUI) index from the FWI system: 1.1 to 68
  11. Fire Weather Index (FWI) Index: 0 to 31.1
  12. Classes: two classes, namely “fire” and “not fire”

Exploratory Analysis

We first start off by importing the necessary libraries for our analysis.

[INSERT DESCRIPTION OF EACH LIBRARY]

Importing the data

The Dataset provided to us was in the form of a .csv file that contained two tables, one table for the observations belonging to the Sidi-Bel Abbes region, and the other for Bejaia.

Before starting our analysis we separated the tables into two distinct files according to the region. We named both files Algerian_forest_fires_dataset_Bejaia.csv and Algerian_forest_fires_dataset_Sidi_Bel_Abbes.csv for Bejaia and Sidi-Bel Abbes respectively.

Cleaning and processing the data

We first check the existence of null values in the Dataset, none were found.

colSums(is.na(df_b))
        day       month        year Temperature          RH 
          0           0           0           0           0 
         Ws        Rain        FFMC         DMC          DC 
          0           0           0           0           0 
        ISI         BUI         FWI     Classes 
          0           0           0           0 
colSums(is.na(df_s))
        day       month        year Temperature          RH 
          0           0           0           0           0 
         Ws        Rain        FFMC         DMC          DC 
          0           0           0           0           0 
        ISI         BUI         FWI     Classes 
          0           0           0           0 

We then process to add a column in both datasets to indicate the region(Wilaya) in each table. We chose the following encoding:

  1. Bejaia = 0
  2. Sidi-Bel Abbes = 1
df_b[["Region"]] = 0
df_s[["Region"]] = 1

After that, we proceed to merge both our datasets into one single dataframe using full_join(), this will allow us to easily explore and analyze the data.

df_s$DC <- as.double(df_s$DC)
Warning: NAs introduced by coercion
df_s$FWI <- as.double(df_s$FWI)
Warning: NAs introduced by coercion
df = full_join(df_s, df_b)
Joining, by = c("day", "month", "year", "Temperature", "RH", "Ws", "Rain", "FFMC", "DMC", "DC", "ISI", "BUI", "FWI", "Classes", "Region")
dim(df)
[1] 244  15
str(df)
'data.frame':   244 obs. of  15 variables:
 $ day        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ month      : int  6 6 6 6 6 6 6 6 6 6 ...
 $ year       : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ Temperature: int  32 30 29 30 32 35 35 28 27 30 ...
 $ RH         : int  71 73 80 64 60 54 44 51 59 41 ...
 $ Ws         : int  12 13 14 14 14 11 17 17 18 15 ...
 $ Rain       : num  0.7 4 2 0 0.2 0.1 0.2 1.3 0.1 0 ...
 $ FFMC       : num  57.1 55.7 48.7 79.4 77.1 83.7 85.6 71.4 78.1 89.4 ...
 $ DMC        : num  2.5 2.7 2.2 5.2 6 8.4 9.9 7.7 8.5 13.3 ...
 $ DC         : num  8.2 7.8 7.6 15.4 17.6 26.3 28.9 7.4 14.7 22.5 ...
 $ ISI        : num  0.6 0.6 0.3 2.2 1.8 3.1 5.4 1.5 2.4 8.4 ...
 $ BUI        : num  2.8 2.9 2.6 5.6 6.5 9.3 10.7 7.3 8.3 13.1 ...
 $ FWI        : num  0.2 0.2 0.1 1 0.9 3.1 6 0.8 1.9 10 ...
 $ Classes    : chr  "not fire   " "not fire   " "not fire   " "not fire   " ...
 $ Region     : num  1 1 1 1 1 1 1 1 1 1 ...
summary(df)
      day            month          year       Temperature   
 Min.   : 1.00   Min.   :6.0   Min.   :2012   Min.   :22.00  
 1st Qu.: 8.00   1st Qu.:7.0   1st Qu.:2012   1st Qu.:30.00  
 Median :16.00   Median :7.5   Median :2012   Median :32.00  
 Mean   :15.75   Mean   :7.5   Mean   :2012   Mean   :32.17  
 3rd Qu.:23.00   3rd Qu.:8.0   3rd Qu.:2012   3rd Qu.:35.00  
 Max.   :31.00   Max.   :9.0   Max.   :2012   Max.   :42.00  
                                                             
       RH              Ws            Rain              FFMC      
 Min.   :21.00   Min.   : 6.0   Min.   : 0.0000   Min.   :28.60  
 1st Qu.:52.00   1st Qu.:14.0   1st Qu.: 0.0000   1st Qu.:72.08  
 Median :63.00   Median :15.0   Median : 0.0000   Median :83.50  
 Mean   :61.94   Mean   :15.5   Mean   : 0.7607   Mean   :77.89  
 3rd Qu.:73.25   3rd Qu.:17.0   3rd Qu.: 0.5000   3rd Qu.:88.30  
 Max.   :90.00   Max.   :29.0   Max.   :16.8000   Max.   :96.00  
                                                                 
      DMC              DC              ISI              BUI       
 Min.   : 0.70   Min.   :  6.90   Min.   : 0.000   Min.   : 1.10  
 1st Qu.: 5.80   1st Qu.: 12.35   1st Qu.: 1.400   1st Qu.: 6.00  
 Median :11.30   Median : 33.10   Median : 3.500   Median :12.25  
 Mean   :14.67   Mean   : 49.43   Mean   : 4.774   Mean   :16.66  
 3rd Qu.:20.75   3rd Qu.: 69.10   3rd Qu.: 7.300   3rd Qu.:22.52  
 Max.   :65.90   Max.   :220.40   Max.   :19.000   Max.   :68.00  
                 NA's   :1                                        
      FWI           Classes              Region   
 Min.   : 0.000   Length:244         Min.   :0.0  
 1st Qu.: 0.700   Class :character   1st Qu.:0.0  
 Median : 4.200   Mode  :character   Median :0.5  
 Mean   : 7.035                      Mean   :0.5  
 3rd Qu.:11.450                      3rd Qu.:1.0  
 Max.   :31.100                      Max.   :1.0  
 NA's   :1                                        
unique(df$year)
[1] 2012
unique(df$month)
[1] 6 7 8 9

We check again for any NA values that might have been introduced into the dataset by merging the data from both tables, we found out there was one row that contained NA value in DC and FWI. We delete that row since it will not affect our overall dataset.

colSums(is.na(df))
        day       month        year Temperature          RH 
          0           0           0           0           0 
         Ws        Rain        FFMC         DMC          DC 
          0           0           0           0           0 
        ISI         BUI         FWI     Classes      Region 
          0           0           0           0           0 
df = df %>% drop_na(DC)
dim(df)
[1] 243  15

We now proceed to display the different range of values some categorical variables might contain, mainly the Classes and the Region columns.

unique(df$Classes)
[1] "not fire   "   "fire   "       "not fire     " "not fire    " 
[5] "fire"          "fire "         "not fire"      "not fire "    
unique(df$Region)
[1] 1 0

We find that the Classes column has values that contain unneeded space characters, we proceed to trim those spaces.

df$Classes <- trimws(df$Classes, which = c("both"))
unique(df$Classes)
[1] "not fire" "fire"    
df = df %>% drop_na(Classes)
df$Classes <- mapvalues(df$Classes, from=c("not fire","fire"), to=c(0,1))
unique(df$Classes)
[1] "0" "1"
df$Classes <- as.numeric(df$Classes)
st(df)

df <- df[-c(3)]

df_scaled = df
df_scaled[-c(1,2,13,14)] <- scale(df[-c(1,2,13,14)])
st(df_scaled)

Visualizing the data

We have ended up with a clean and scaled dataframe named df_scaled, which we will use to visualize and further explore our data.

Our first instinct is to compare the two regions together in terms of number of fires, and average temperature.

aggregate(df$Classes ~ df$Region, FUN = sum)
aggregate(df$Temperature ~ df$Region, FUN = mean)

We used the unscaled dataset to plot the real life values of the temperatures.

df %>%
  group_by(Region) %>%
  summarise(Region = Region, Number_of_fires = sum(Classes), Temperature = mean(Temperature)) %>%
  ggplot(aes(x=Region, y=Number_of_fires, fill = Temperature))+
  geom_col(position='dodge')
`summarise()` has grouped output by 'Region'. You can override using the `.groups` argument.

We can see that the the Sidi-Bel Abbes region has in total a greater number of fires and a higher average temperature throughout the summer of 2012.

Model Building

Correlation Matrix

The previous results push us to suspect a positive relationship between the temperature and the likelihood of having a fire. However, we need to investigate all the other variables, which is why we will plot a correlation matrix of the features in the dataset.

corr_mat <- round(cor(df_scaled),2)
p_mat <- cor_pmat(df_scaled)
 
corr_mat <- ggcorrplot(
  corr_mat, 
  hc.order = FALSE, 
  type = "upper",
  outline.col = "white",
)
 
ggplotly(corr_mat)

Feature Selection

We performed feature selection using the Caret package to determine which features are the most important and which are the least.

In this case, we opted for Linear Discriminant Analysis with Stepwise Feature Selection by specifying stepLDA as our method.

The varImp function returns a measure of importance out of 100 for each of the features. According to the official Caret documentation, the importance metric is calculated by conducting a ROC curve analysis on each predictor; a series of cutoffs is applied to the predictor data to predict the class. The AUC is then computed and is used as a measure of variable importance.

# prepare training scheme
set.seed(7)
df_scaled$Classes = as.factor(df_scaled$Classes)

control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
modelLDA <- train(Classes~., data=df_scaled, method="stepLDA", trControl=control)
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9039;  in: "ISI";  variables (1): ISI 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.472 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9039;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96299;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.803 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90909;  in: "ISI";  variables (1): ISI 
correctness rate: 0.9632;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.804 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89502;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96342;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.785 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.91277;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96818;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
       0.00        0.00        0.73 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89091;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96818;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.726 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9039;  in: "ISI";  variables (1): ISI 
correctness rate: 0.95909;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.924 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90801;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96797;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.913 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90411;  in: "ISI";  variables (1): ISI 
correctness rate: 0.97273;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.722 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89481;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96775;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.926 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9;  in: "ISI";  variables (1): ISI 
correctness rate: 0.95;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.741 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89481;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96775;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
       0.00        0.00        0.82 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90823;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96342;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       1.058 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90823;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96797;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.914 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.91342;  in: "ISI";  variables (1): ISI 
correctness rate: 0.97273;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.732 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.91255;  in: "ISI";  variables (1): ISI 
correctness rate: 0.9632;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.824 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89957;  in: "ISI";  variables (1): ISI 
correctness rate: 0.95887;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.919 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90823;  in: "ISI";  variables (1): ISI 
correctness rate: 0.97273;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       1.409 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90909;  in: "ISI";  variables (1): ISI 
correctness rate: 0.97273;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       1.497 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89848;  in: "ISI";  variables (1): ISI 
correctness rate: 0.95411;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       1.564 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9039;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96775;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
       0.00        0.00        0.75 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9039;  in: "ISI";  variables (1): ISI 
correctness rate: 0.9632;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.727 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89545;  in: "ISI";  variables (1): ISI 
correctness rate: 0.95455;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.775 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90866;  in: "ISI";  variables (1): ISI 
correctness rate: 0.95866;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.811 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89935;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96775;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       1.298 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.89545;  in: "ISI";  variables (1): ISI 
correctness rate: 0.97273;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       1.687 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90411;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96818;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
       0.00        0.00        0.74 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.91277;  in: "ISI";  variables (1): ISI 
correctness rate: 0.97251;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       5.912 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90411;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96797;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.794 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.90411;  in: "ISI";  variables (1): ISI 
correctness rate: 0.97727;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.732 
 `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.
243 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9055;  in: "ISI";  variables (1): ISI 
correctness rate: 0.96283;  in: "FFMC";  variables (2): ISI, FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.752 
modelQDA <- train(Classes~., data=df_scaled, method="stepQDA", trControl=control)
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9816;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.537 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97706;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
        0.0         0.0         0.4 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97727;  in: "ISI";  variables (1): ISI 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.634 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.736 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97229;  in: "ISI";  variables (1): ISI 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.889 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97229;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.582 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97727;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
        0.0         0.0         0.3 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.489 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97727;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.497 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.487 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97727;  in: "ISI";  variables (1): ISI 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.299 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.494 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.489 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97251;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.495 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.487 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97706;  in: "ISI";  variables (1): ISI 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.298 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97684;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.485 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97706;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.493 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97727;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.489 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.98182;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.299 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.297 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97229;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.491 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.98636;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
       0.00        0.00        0.52 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97229;  in: "ISI";  variables (1): ISI 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.507 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
220 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97727;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.305 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97727;  in: "ISI";  variables (1): ISI 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.496 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.98182;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.493 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97727;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.499 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
218 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97229;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.518 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
219 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.97273;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.304 
 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
243 observations of 13 variables in 2 classes; direction: both
stop criterion: improvement less than 5%.
correctness rate: 0.9755;  in: "FFMC";  variables (1): FFMC 

 hr.elapsed min.elapsed sec.elapsed 
      0.000       0.000       0.341 
importanceLDA <- varImp(modelLDA, scale=FALSE)
importanceQDA <- varImp(modelQDA, scale=FALSE)

plot(importanceLDA)

plot(importanceQDA)

We can see that the variables month, Ws, Region, and day are insignificant compared to other features. We will disregard them in our model.

Logistic Regression

Splitting the dataset

We first start by performing Logistic Regression on our dataset. We begin by splitting the data into train/test sets with a 80/20 split. This split was chosen by default as a good practice.

split <- sample.split(df_scaled, SplitRatio=0.8)

train_set <- subset(df_scaled, split == "TRUE")
test_set <- subset(df_scaled, split=="FALSE")
head(train_set)

Training the model

We create our model with the features that were the most important during our feature selection step. Then, we fit the model to our training data.

After that, we test our model on the test set and use a threshold of 0.5 to set our predictions. This will result in having only one False Positive and one False Negative prediction. By the end, our model has reached an accuracy of 94% on our test data.

summary(logistic_model)

Call:
glm(formula = Classes ~ Temperature + Rain + FFMC + DMC + DC + 
    ISI + BUI + FWI + RH, family = "binomial", data = train_set)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-2.489e-04  -2.100e-08   2.100e-08   2.100e-08   2.684e-04  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)     86.43   45825.71   0.002    0.998
Temperature    -74.96    8341.92  -0.009    0.993
Rain           152.35   22025.25   0.007    0.994
FFMC           415.65   82591.55   0.005    0.996
DMC            -81.94   25166.62  -0.003    0.997
DC             133.93   17053.69   0.008    0.994
ISI            489.64  139385.17   0.004    0.997
BUI             53.22   32538.27   0.002    0.999
FWI           -184.61  144165.97  -0.001    0.999
RH              10.26    3811.51   0.003    0.998

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2.6150e+02  on 190  degrees of freedom
Residual deviance: 2.2644e-07  on 181  degrees of freedom
AIC: 20

Number of Fisher Scoring iterations: 25

Testing the model

predict
           3            6           11           17           20 
2.220446e-16 3.119060e-03 1.000000e+00 2.220446e-16 2.220446e-16 
          25           31           34           39           45 
1.000000e+00 2.220446e-16 1.000000e+00 1.000000e+00 1.000000e+00 
          48           53           59           62           67 
1.000000e+00 2.220446e-16 1.000000e+00 1.000000e+00 2.220446e-16 
          73           76           81           87           90 
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
          95          101          104          109          115 
2.220446e-16 2.220446e-16 1.000000e+00 1.000000e+00 2.220446e-16 
         118          123          129          132          137 
2.220446e-16 2.220446e-16 1.000000e+00 1.000000e+00 2.220446e-16 
         143          146          151          157          160 
2.220446e-16 1.000000e+00 1.000000e+00 1.000000e+00 2.220446e-16 
         165          171          174          179          185 
2.220446e-16 1.000000e+00 2.220446e-16 1.000000e+00 2.220446e-16 
         188          193          199          202          207 
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
         213          216          221          227          230 
1.000000e+00 2.220446e-16 2.220446e-16 2.220446e-16 1.000000e+00 
         235          241 
2.220446e-16 2.220446e-16 
predict
  3   6  11  17  20  25  31  34  39  45  48  53  59  62  67  73  76 
  0   0   1   0   0   1   0   1   1   1   1   0   1   1   0   1   1 
 81  87  90  95 101 104 109 115 118 123 129 132 137 143 146 151 157 
  1   1   1   0   0   1   1   0   0   0   1   1   0   0   1   1   1 
160 165 171 174 179 185 188 193 199 202 207 213 216 221 227 230 235 
  0   0   1   0   1   0   1   1   1   1   1   1   0   0   0   1   0 
241 
  0 
table(test_set$Classes,predict)
   predict
     0  1
  0 22  1
  1  1 28
print(paste('Accuracy =',1-misclassifications))
[1] "Accuracy = 0.961538461538462"

Plotting the ROC curve

[INTERPRETATION]

---
title: <center>Algerian Forest Fire Analysis</center><br/>
output: html_notebook
author:
  - Mohamed Rissal Hedna 201906233
  - Younes Djemmal 201906xxx
---
## Overview
### Background

During the summer of 2012, [wild fires](https://www.un-spider.org/news-and-events/news/algeria-maps-summer-2012-wildfires-available) ravaged throughout the Algerian territory covering most of the northern part, especially the coastal cities. This disaster was due to the higher than average temperatures which reached as high as 50 degrees Celcius.

### Objectives 

One important measure against the reproduction of such disasters is the ability to predict their occurrence. Moreover, in this project, we will attempt to predict these forest fires based on multiple features related to weather indices.

### Dataset Description

The Dataset we will use to train and test our models consists of 244 observations on two Algerian Wilayas (cities): Sidi-Bel Abbes and Bejaia. The observations have been gathered throughout the duration of 4 months from June to September 2012 for both cities.

**The Dataset contains the following variables:**

1. Date: (DD/MM/YYYY) Day, month ('june' to 'september'), year (2012)
2. Temp: temperature noon (temperature max) in Celsius degrees: 22 to 42
3. RH: Relative Humidity in %: 21 to 90
4. Ws: Wind speed in km/h: 6 to 29
5. Rain: total day in mm: 0 to 16.8<br>
**FWI Components (check this [LINK](https://cwfis.cfs.nrcan.gc.ca/background/summary/fwi) for more information)**
6. Fine Fuel Moisture Code (FFMC) index from the FWI system: 28.6 to 92.5
7. Duff Moisture Code (DMC) index from the FWI system: 1.1 to 65.9
8. Drought Code (DC) index from the FWI system: 7 to 220.4
9. Initial Spread Index (ISI) index from the FWI system: 0 to 18.5
10. Build-up Index (BUI) index from the FWI system: 1.1 to 68
11. Fire Weather Index (FWI) Index: 0 to 31.1
12. Classes: two classes, namely "fire" and "not fire"

## Exploratory Analysis

We first start off by importing the necessary libraries for our analysis.

[INSERT DESCRIPTION OF EACH LIBRARY]

```{r, include=FALSE}
library(dplyr)
library(vtable)
library(plyr)
library(ggplot2)
library(ggcorrplot)
library(plotly)
library(tidyverse)
#Feature selection libraries
library(mlbench)
library(caret)
#For Logistic regression
library(caTools)
#For ROC curve
library(ROCR)
```


### Importing the data 

The Dataset provided to us was in the form of a .csv file that contained two tables, one table for the observations belonging to the Sidi-Bel Abbes region, and the other for Bejaia. 

Before starting our analysis we separated the tables into two distinct files according to the region. We named both files *Algerian_forest_fires_dataset_Bejaia.csv* and *Algerian_forest_fires_dataset_Sidi_Bel_Abbes.csv* for Bejaia and Sidi-Bel Abbes respectively.


```{r, echo=FALSE}
df_b <- read.csv("./Algerian_forest_fires_dataset_Bejaia.csv")
df_s <- read.csv("./Algerian_forest_fires_dataset_Sidi_Bel_Abbes.csv")
```

### Cleaning and processing the data

We first check the existence of null values in the Dataset, none were found.


```{r}
colSums(is.na(df_b))
colSums(is.na(df_s))
```

We then process to add a column in both datasets to indicate the region(Wilaya) in each table. We chose the following encoding:

1. Bejaia = 0
2. Sidi-Bel Abbes = 1


```{r}
df_b[["Region"]] = 0
df_s[["Region"]] = 1
```

After that, we proceed to merge both our datasets into one single dataframe using *full_join()*, this will allow us to easily explore and analyze the data.

```{r}
df_s$DC <- as.double(df_s$DC)
df_s$FWI <- as.double(df_s$FWI)

df = full_join(df_s, df_b)

dim(df)
str(df)
```

```{r}
summary(df)
unique(df$year)
unique(df$month)
```

We check again for any *NA* values that might have been introduced into the dataset by merging the data from both tables, we found out there was one row that contained NA value in DC and FWI. We delete that row since it will not affect our overall dataset.

```{r}
colSums(is.na(df))
df = df %>% drop_na(DC)
dim(df)
```

We now proceed to display the different range of values some categorical variables might contain, mainly the Classes and the Region columns.


```{r}
unique(df$Classes)
unique(df$Region)
```

We find that the Classes column has values that contain unneeded space characters, we proceed to trim those spaces.

```{r}
df$Classes <- trimws(df$Classes, which = c("both"))
```



```{r}
unique(df$Classes)
df = df %>% drop_na(Classes)
```

```{r}
df$Classes <- mapvalues(df$Classes, from=c("not fire","fire"), to=c(0,1))
```

```{r}
unique(df$Classes)
df$Classes <- as.numeric(df$Classes)
st(df)
```

```{r}

df <- df[-c(3)]

df_scaled = df
df_scaled[-c(1,2,13,14)] <- scale(df[-c(1,2,13,14)])
st(df_scaled)

```

### Visualizing the data

We have ended up with a clean and scaled dataframe named *df_scaled*, which we will use to visualize and further explore our data.

Our first instinct is to compare the two regions together in terms of number of fires, and average temperature. 

```{r}
aggregate(df$Classes ~ df$Region, FUN = sum)
aggregate(df$Temperature ~ df$Region, FUN = mean)
```
We used the unscaled dataset to plot the real life values of the temperatures.

```{r}
df %>%
  group_by(Region) %>%
  summarise(Region = Region, Number_of_fires = sum(Classes), Temperature = mean(Temperature)) %>%
  ggplot(aes(x=Region, y=Number_of_fires, fill = Temperature))+
  geom_col(position='dodge')
```

We can see that the the Sidi-Bel Abbes region has in total a greater number of fires and a higher average temperature throughout the summer of 2012.

## Model Building
### Correlation Matrix

The previous results push us to suspect a positive relationship between the temperature and the likelihood of having a fire. However, we need to investigate all the other variables, which is why we will plot a correlation matrix of the features in the dataset.

```{r}
corr_mat <- round(cor(df_scaled),2)
p_mat <- cor_pmat(df_scaled)
 
corr_mat <- ggcorrplot(
  corr_mat, 
  hc.order = FALSE, 
  type = "upper",
  outline.col = "white",
)
 
ggplotly(corr_mat)
```

### Feature Selection

We performed feature selection using the Caret package to determine which features are the most important and which are the least. 

In this case, we opted for Linear Discriminant Analysis with Stepwise Feature Selection by specifying *stepLDA* as our method.

The *varImp* function returns a measure of importance out of 100 for each of the features. According to the official [Caret documentation](https://topepo.github.io/caret/variable-importance.html), the importance metric is calculated by conducting a ROC curve analysis on each predictor; a series of cutoffs is applied to the predictor data to predict the class. The AUC is then computed and is used as a measure of variable importance. 


```{r}
# prepare training scheme
set.seed(7)
df_scaled$Classes = as.factor(df_scaled$Classes)

control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
modelLDA <- train(Classes~., data=df_scaled, method="stepLDA", trControl=control)
modelQDA <- train(Classes~., data=df_scaled, method="stepQDA", trControl=control)

importanceLDA <- varImp(modelLDA, scale=FALSE)

plot(importanceLDA)
```
We can see that the variables *month*, *Ws*, *Region*, and *day* are insignificant compared to other features. We will disregard them in our model.

### Logistic Regression
#### Splitting the dataset

We first start by performing Logistic Regression on our dataset. We begin by splitting the data into train/test sets with a 80/20 split. This split was chosen by default as a good practice.

```{r}
split <- sample.split(df_scaled, SplitRatio=0.8)

train_set <- subset(df_scaled, split == "TRUE")
test_set <- subset(df_scaled, split=="FALSE")
head(train_set)
```
#### Training the model

We create our model with the features that were the most important during our feature selection step. Then, we fit the model to our training data.

After that, we test our model on the test set and use a threshold of 0.5 to set our predictions. This will result in having only one False Positive and one False Negative prediction. By the end, our model has reached an accuracy of 94% on our test data.

```{r}
logistic_model <- glm(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data=train_set, family="binomial")
summary(logistic_model)
```

#### Testing the model

```{r}
predict <- predict(logistic_model, test_set, type="response")
predict
```

```{r}
predict <- ifelse(predict >0.5,1,0)
predict
```
```{r}
table(test_set$Classes,predict)
```
```{r}
misclassifications <- mean(predict != test_set$Classes)

print(paste('Accuracy =',1-misclassifications))
```

#### Plotting the ROC curve

```{r}
ROCPred <- prediction(predict,test_set$Classes)
ROCPer <- performance(ROCPred, measure="tpr",x.measure="fpr")
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc
plot(ROCPer)
```

[INTERPRETATION]
